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Abstract 

The evolution in coding DNA sequences brings new flexibility and freedom to the codon 
words, even as the underlying nucleotides get significantly ordered. These curious contra-rules 
of gene organisation are observed from the distribution of words and the second moments of the 
nucleotide letters. These statistical data give us the physics behind the classification of bacteria. 
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Over the years the statistical approach to genes has become prominent. The hidden Markov 
models are used in the alignment routines of biological sequences [1]. For the secondary structures of 
the sequences stochastic context-free and context-sensitive grammars [2] are apphed [3]. The recent 
discovery of the fractal inverse power-law correlations [4] in these biological chains have led to ideas 
that statistically these sequences have features of music and languages [5-7] . Languages evolve with 
time. The vocabulary increases; the rules that dominate get progressively optimised so the order 
and information content is more. The purpose of this work is to track the statistical basis of the 
evolution in the coding DNA sequences (CDS). 

The CDS are multiple of 3-tuples, the codons. The nucleotides adenine (A), cytosine (C), guanine 
(G) and thymine (T) taken in groups of three work to build the amino acid chains called proteins. 
The word-structure of CDS is, therefore, well known. We want to study evolution in terms of these 
words, their distributions and the moments. 

It is known that any prose does not carry all the ingredients of evolution of languages. Similarly 
the CDS of any gene does not have all the salient features that accompany change. The genes that 
are present in the whole range of organisms, from the lowest bacteria to the highest mammals, and 
therefore connected to fundamental life processes are normally considered to be best suited to function 
as evolutionary markers. With this in view we choose glyceraldehyde-3-phosphate dehydrogenase 
(GAPDH) CDS for its ubiquitous presence in all living beings. The enzyme it codes for catalyses 
one of the crucial energy-producing steps of glycolysis, the common pathway for both aerobic and 
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anaerobic respiration. 

Distribution of words is studied for languages. The frequency of words is plotted against the 
rank. Here the total number of occurrences of a particular word is termed its frequency. The word 

most frequent has rank=l, the next most has rank=2, and so on. For natural languages, the plot 
gives the Zipf [8] behaviour: 

A = ^ (1) 

where N stands for the rank and /i and /jv are the frequencies of words of rank 1 and N respectively. 
The Zipf-type approach to the study of DNA has brought methods of statistical hnguistics into DNA 
analysis [6] . The generalized Zipf distribution of n-tuples has provided hints that the DNA sequences 
may have some structural features common to languages. In this work we confine ourselves to the 
distribution of 3-tuples, the codons, in the CDS. The words, therefore, are non-overlapping and on 
the single reading frame. 

The frequency- vs-rank plot of the codon words show that these distributions, given the frequency 
of rank 1 and the length of the sequence, are almost completely defined through the universal 
exponential functional form [9]: 

fn = /i.e-^(^-^) (2) 
The parameter, called /3, is determined by the ratio 

(3^^ (3) 



f3 measures the frequency of rank 1 per unit length of the sequence. The exponential form (2) is 
to be compared to the usual Boltzmann distribution. The rank of the word is akin to energy; (3 is 
analogous to inverse temperature. The relationship (3) that /? is frequency of rank 1 per unit length 
is supported well from data [9] . The analogy between word distributions and the classical Boltzmann 
concepts goes deeper. A decrease in P, from (3), implies frequency of rank 1 per unit length goes 
down. In that case the vocabulary clearly increases. More words are used, thereby more states are 
accessed. For the GAPDH CDS we find the evolution is driving it to higher temperatures; into more 
freedom for words, into more randomisation. evolves monotonically. 

Underneath, however, there runs a curious counterfiow. Suppose we look into the nucleotides 
that constitute the sequence, once again in windows of size 3 and in the same reading frame. First, 
we ask how much order there is in the sequence. To find out we study the second moments of the 
letters A, C, G and T. These second moments, by themselves, do not produce any pattern. The 
GAPDH CDS has about 1000 bases. For each organism the proportions of A, C, G and T in the 
GAPDH CDS are different. This strand-bias, interestingly, masks a remarkable underlying trend. 

To get there the strand-bias has to be ehminated. The order in the sequence, we assume, is its 
deviation from the random. We define the quantity X, a measure of this deviation, as follows: 

Second Moment of the Base Distribution in GAPDH CDS 

Second Moment of the Base Distribution in the random sequence with identical strand bias 
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Normalised as above, the effect of tlie strand bias is unmasked. X values of GAPDH change mono- 
tonically with evolution. The data tells us there is an increase in persistence amongst the letters (in 
windows of size 3) with evolution in the CDS [10]. 

The evolution in the GAPDH CDS is then the result of these two contra trends: while words 
acquire greater uniformisation, the underlying letters have more order. The monotonic behaviours 
of P and X with evolution give us the physics behind the biological classification of bacteria. 



Methods 

Word Distributions 

For the codons it is known [9] the exponentials give somewhat better fits over the usual power laws. 
The exponential form, equation (2), is characterized by the parameter {3. The quantity has some 
universal features in that it is almost completely determined by /i and the length of the CDS. The 
relationship [9] 

is known to fit observations on diverse genes. For the bacterial GAPDH CDS the results of j3 are 

given in Table 2. 

Moments 

Consider the 4-dimensional walk model [11,12] such that A, C, G and T correspond to unit steps, in 
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the positive direction, along Xa, Xc, Xq and Xt axes. After n-steps if the co-ordinate of the walker 
is {ua-i nc, riQ, ut), then, clearly, 

n = ua + ric + no + riT (5) 

and rii {i = A,C,G,T), is the number of nucleotide of type i in the sequence just walked. 

If the sequence has n bases, and rii is the number of base of type i, the strand bias of the sequence 
is the proportion of rii in n, defined as 

n 

The probability distribution for the single step in this 4-d walk is 

Pi(x) = (7) 

i 

where b is the usual 5-function of Dirac. The characteristic function of the step is the Fourier 
transform of equation (7), 

^l(^) = T.V^^''' (8) 
i 

The characteristic function of I steps 

pm = {p,{kyt (9) 

The second moments (i.e. the average values) of distributions may be obtained taking derivatives 

of -P;(A;) with respect to k. Thus for the random sequence (indicated by the subscript r) with the 
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strand bias (6), we get the average values: 

<n^>,= l[{l-i)p^+p,] (10) 

< riirij >r = - l){Pi-Pj) {i 7^ j) (11) 

We are interested in codons, therefore, the window size I in equations (10) and (11) is chosen to 
be 3. For the actual sequences we calculate < nf >seq and < niUj >seq- The quantities 

Xn = ^^ip^ (12) 



[where D = AA,CC,GG,TT] 
and 



""oo^^^^ i'^^) (13) 



[where OD = AC,AG,AT,CG,CT,GT] 

measure the deviation of the diagonal and off-diagonal second moments of the sequence to those 
of the random sequence of identical strand bias respectively. Finally, we come up with an over-all 
averaged index, X, given by 

10 ^ ^ 

This X provides a measure of the order in the sequences. 
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Observations and Results 

To set the basis for what we discuss later, we begin by recording the P and the X values of higher 
organisms, the eukaryotes (Table 1). We confine our discussion of the eukaryotes to three broad 
categories: fungi, invertebrates and vertebrates. It is known [13] from fossil records the oldest fungi 
came about 900 miUion years (Myr) before present (bp). The oldest fungal species, identified with 
certainty, are from the Ordovician period, i.e., some 500 Myr bp. The fossil records of invertebrates 
suggest this group came about the same time as the fungi. The vertebrates came later, about 400 
Myr bp, in late Ordovician and Silurian period. 

Let us look at the /3 and the X values of these eukaryotic groups. Fungi has the highest, followed 
by invertebrates, while for the vertebrates the /3 and the X reach minima. We conclude the P and 
the X decrease with evolution. The data further suggest fungi and invertebrates came about the 
same time and underwent parallel evolution, while the representives of the vertebrate group came 
later in the evolutionary line-up. 

Having set the basis, let us now look at 14 bacterial species from three groups: cyanobacteria, 
proteobacteria (that includes vast majority of gram-negative bacteria), and the Bacillus/ Clostridium 
group, a type of gram-positive bacteria. Table 2 summarises the (3 and the X values of these samples. 
These bacterial groups arose during the Precambrian period of geological time-scale, but there are 
several schools of thought regarding their specific times of origin within this period. 

We approach the bacterial GAPDH CDS with two differing statistical measures, the (5 and the 
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X. Interestingly, both give us almost identical trends (Figs. 1 and 2). Lactobacillus delbrueckii, 
a member of the Bacillus / Clostridium group, has the highest P and X values (Table 2). There 
is then a large measure of overlap between the Bacillus/ Clostridium group and the proteobacteria 
(Figs. 1 and 2). The extent of overlap of the (3 values is somewhat more than that of the X. The 
cyanobacterial samples have the minimum values of the /3 and the X. There is no overlap between the 
cyanobacterial values of the /3 and the X with the Bacillus / Clostridium group. The overlap between 
the proteobacteria and the cyanobacteria is small. Only one proteobacterial sample. Brucella abortus 
has greater /? value than the cyanobacterial member, Synechocystis sp. (strain PCC 6803). 

The averages of the P or the X has the maximum value in the Bacillus / Clostridium group, 
followed by the proteobacteria, while the cyanobacteria samples have the lowest values. In line with 
our observations on the eukaryotes, we propose (Figs. 1 and 2) that the Bacillus/ Clostridium group 
originated some time before the proteobacterial species, but later both groups evolved in parallel. 
The cyanobacterial samples are of recent origin compared to these groups. The trends in the /3 and 
the X give us identical patterns that segregate the bacterial species into groups. Amusingly, the 
results seem to be in agreement with what is accepted so far regarding the phylogenetic relationships 
among these three groups [14]. Our study of the GAPDH CDS, its word distributions, and the 
moments gives us the physics underlying evolution. 

The decrease in /3 with evolution for the GAPDH CDS tells us that evolution is taking the gene 
progressively towards higher temperatures. The /? value, we recall, is the frequency of rank one per 
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unit length. Lowering of the (3 imphes less dominance of the maximum weight. In consequence, 
the other words enjoy greater freedom, the vocabulary increases and more states are accessed. In a 
sense the evolution in the GAPDH CDS mirrors Boltzmannian statistics. Even though the GAPDH 
CDS has evolved in a complex evolutionary regime in contact with environment, the Boltzmannian 
behaviour is useful. For instance, it allows us to define the word-entropy of the CDS. That gives us 
a measure of the information content of the words in biological chain. 

At the level of the nucleotide letters A, C, G and T, the order is measured by the quantity X. 
As we look into the diagonal averages Xo, (12), we find it increases with evolution. For the window 
of size 3, this growing diagonal moment implies a rising persistent correlation. In consequence, the 
off-diagonal averages Xqd, (13), go down, decreasing antipersistence. Looked at from the letters, the 
sequences become less uniform and deviate more from the random sequence of identical strand bias. 
The order, or the information, in the arrangement of letters shows a rising trend with evolution. 

Does any CDS that is an evolutionary marker evolve in ways similar to the GAPDH? We have 
worked with the CDS of some other glycotic enzymes, such as phosphoglycerate kinase, and found 
they behave similarly. Other evolutionary markers such as the ribulose-l,5-bisphosphate carboxy- 
lase/oxygenase enzyme large segment (rbcL) show similar behaviour. We use these data for biological 
subclassification. The CDS for ribosomal RNA is another class of sequence that is being investigated. 
It does not code for protein, but for RNA, and has periods other than 3. The 3 period does exist, 
but is not predominant. 
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Sequence modeling has recently become important. The fractal correlations in the sequences led 
to the expansion- modification system [15]. Later came the insertion models [16]. Here the differences 
in the CDS and non-coding sequences were observed and the non-coding sequences modeled. The 
unifying models of copying-mistake- maps [17] modeled both the coding and the non-coding parts. 
In these models the statistical features of the non-coding sequences have received emphasis. The 
evolutionary features of the GAPDH CDS isolates the statistical aspects that underlie evolution in 
coding sequences. The statistics of the word distributions and the subtle cross current of the second 
moments, we hope, will lead further in these efforts. 
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Figure Legends 



Figure 1. The average (5 values for the GAPDH CDS from three bacterial groups (see Table 3). 
The error bars indicate the standard deviation from the average values. 

Figure 2. The average X values for the GAPDH CDS from three bacterial groups (see Table 3). The 
error bars indicate the standard deviation from the average values. 
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Table 1: The average (3 and X values of GAPDH CDS for eukaryotic groups, along with 
the range of deviations in the respective groups. 



Group 




X 


Vertebrates 


0.05398 (±0.00414) 


0.99698 (±0.004) 


Invertebrates 


0.07503 (±0.01067) 


1.00235 (±0.00261) 


Fungi 


0.07742 (±0.00389) 


1.00705 (±0.00175) 
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Table 2: The (3 and the X values of the GAPDH CDS from the bacterial species that 
have been used in our study (source: GenBank and EMBL databases). 



Organism 


Accession No. 


Group 


/3 


X 


Bacillus megaterium 


M87647 


Bacillus/ Clostridium 


0.07662 


1.01185 


Bacillus subtilis 


X13011 


Bacillus 1 Clostridium 


0.07431 


1.00912 


Clostridium pasteurianum 


X72219 


Bacillus 1 Clostridium 


0.07837 


1.00483 


Lactobacillus delbrueckii 


AJ000339 


Bacillus / Clostridium 


0.08529 


1.01861 


Lactococcus lactis 


L36907 


Bacillus / Clostridium 


0.06038 


1.00245 


Pseudomonas aeruginosa 


M74256 


Proteobacteria 


0.08166 


1.00338 


Escherichia coli 


X02662 


Proteobacteria 


0.08366 


1.00447 


Brucella abortus 


AF095338 


Proteobacteria 


0.05713 


1.00604 


Zymomonas mobilis 


M18802 


Proteobacteria 


0.07721 


1.00457 
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Organism 


Accession No. 


Group 




X 


Rhodobacter sphaeroides 


M68914 


Proteobacteria 


0.06539 


1.00564 


Xanthobacter flavus 


U33064 


Proteobacteria 


0.06839 


1.00086 


Anabaena variabilis 


L07498 


Cyanobacteria 


0.04547 


1.00073 


Synechococcus PCC 7942 


X91236 


Cyanobacteria 


0.05025 


0.99988 


Synechocystis PCC 6803 


X83564 


Cyanobacteria 


0.06043 


0.99101 
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Table 3: The average (3 and X values of the GAPDH CDS for the three bacterial groups, 
along with the range of deviations in the respective groups. 



Group 


/3 


X 


Bacillus 1 Clostridium 


0.07499 (±0.00914) 


1.00937 (±0.00633) 


Proteobacteria 


0.07224 (±0.01033) 


1.00416 (±0.00187) 


Cyanobacteria 


0.05205 (±0.00764) 


0.99721 (±0.00538) 
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0.080 - 
0.075 ■ 
0.070 H 
0.065 ■ 
0.060 - 
0.055 - 
0.050 ■ 
0.045 H 
0.040 ■ 



1 - Bacillus I Clostridium group 

2 - Proteobacteria 

3 - Cyanobacteria 
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